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EMPIRICAL RISK MINIMIZATION IN INVERSE PROBLEMS^ 

By Jussi Klemela and Enno Mammen 

University of Oulu and University of Mannheim 

We study estimation of a multivariate function / : R'' — >■ R when 
the observations are available from the function Af , where A is a 
known hnear operator. Both the Gaussian white noise model and 
density estimation are studied. We define an L2-empirical risk func- 
tional which is used to define a 5-net minimizer and a dense empirical 
risk minimizer. Upper bounds for the mean integrated squared error 
of the estimators are given. The upper bounds show how the diffi- 
culty of the estimation depends on the operator through the norm 
of the adjoint of the inverse of the operator and on the underlying 
function class through the entropy of the class. Corresponding lower 
bounds are also derived. As examples, we consider convolution op- 
erators and the Radon transform. In these examples, the estimators 
achieve the optimal rates of convergence. Furthermore, a new type of 
oracle inequality is given for inverse problems in additive models. 

1. Introduction. We consider estimation of a function / : R"^ R when 
a linear transform Af of the function is observed under stochastic noise. 
We consider both the Gaussian white noise model and density estimation 
with i.i.d. observations. We study two estimators: a (5-net estimator which 
minimizes the L2-empirical risk over a minimal (5-net of a function class and 
a dense empirical risk minimizer which minimizes the empirical risk over 
the whole function class without restricting the minimization over a (5-net. 
We call this estimator a "dense minimizer" because it is defined as a mini- 
mizer over a possibly uncountable function class. The (5-net estimator is more 
universal: it may also be applied for nonsmooth functions and for severely 
ill-posed operators. On the other hand, the dense empirical minimizer is ex- 
pected to work only for relatively smooth cases (the entropy integral has to 



Received June 2008; revised June 2009. 

^Supported by Deutsche Forschungsgemeinschaft under Project MA1026/8-1. 
AMS 2000 subject classification. 62G07. 

Key words and phrases. Deconvolution, empirical risk minimization, multivariate den- 
sity estimation, nonparametric function estimation. Radon transform, tomography. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Statistics^ 

2010, Vol. 38, No. 1, 482-511. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 



J. KLEMELA AND E. MAMMEN 



converge). However, because the minimization in the calculation of this esti- 
mator is not restricted to a (5- net, we have available a larger toolbox of algo- 
rithms for finding (an approximation of) the minimizer of the empirical risk. 

Let (Y,y,u) be a Borel space and let A:L2(R.'^) — )• -L2(Y) be a linear 
operator, where L2(R'^) is the space of square integrable functions /iR'^ — )• 
R (with respect to the Lebesgue measure) and L2(Y) is the space of square 
integrable functions 5f:Y— t-R (with respect to measure z^). In the density 
estimation model, we have i.i.d. observations 

(1) yi,...,y„GY 

with common density function Af : Y — )■ R, where / : R*^ — )■ R is a density 
function which we want to estimate. In the Gaussian white noise model, the 
observation is a realization of the process 

(2) dYn{y) = {Af){y)dy + n-^/^dW{y), y G Y, 

where W{y) is the Brownian process on Y, that is, for /ii,/i2 G L2(Y), 
the random vector ( Jy hi dW, Jy /i2 dW) is a two-dimensional Gaussian ran- 
dom vector with zero mean, marginal variances H/iiHi) ll^2||2 ^^'^ covariance 
Jy^i^2'^i^- (In our examples, Y is either the Euclidean space or the prod- 
uct of the real half-line with the unit sphere so that the existence of the 
Brownian process is guaranteed.) We want to estimate the signal function 
/ : R'^ — )• R. The Gaussian white noise model is very useful for presenting 
the basic mathematical ideas in a transparent way. For the 5-net estima- 
tor, the treatment is almost identical for the Gaussian white noise model 
and for the density estimation, but when we consider the dense empirical 
risk minimization, then, in the density estimation model, we need to use 
bracketing numbers and empirical entropies with bracketing, instead of the 
usual L2-entropies. Our results for the Gaussian white noise model can also 
serve as a first step for obtaining analogous results for inverse problems in 
regression or in other statistical models. 
The L2-empirical risk is defined by 



(3) in{g) 



-2 / (Q(j() (il^ -|- ll^illl, Gaussian white noise, 

Jy 

n 

-2n~^ ''^^{Qg){Yi) + \\g\\2, density estimation. 



where Q is the adjoint of the inverse of A: 

(4) / {A-'h)g= [ h{Qg)dv 

for heL2(Y),g£ L2(R'^). The operator Q = {A~^)* has the domain L2(R'^), 
similarly as A. Minimizing ||/ — / Hi with respect to estimators / is equiva- 
lent to minimizing ||/ — / ||2~ ll/lli have, in the Gaussian white noise 
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model, 

ll/-/lli-ll/lli = -2/ + 

(5) =-2j^{Af){Qf)du+\\fg 

^-2 j^{Qf)dYn + \\f\\l=^nCf)- 
The usual least squares estimator is defined as a minimizer of the criterion 

(6) \\Af - AfWl - WAfWl « -2^(A5) dYn + \\Ag\\l UoY 

see, for example, O'Sullivan (1986). In density estimation, the log-likelihood 
empirical risk has been more common than the L2-empirical risk and in 
the setting of inverse problems, the log-likelihood is defined as 7n(5') = 
—'n~^'^^^^\og{Ag) X (Fj), analogously to (6). These alternative definitions 
of the empirical risk do not seem to lead to such an elegant theory as does 
the empirical risk in (3). The empirical risk in (3) has been used in deconvo- 
lution problems for projection estimators by Comte, Taupin and Rozenholc 
(2006). 

We give upper bounds for the mean integrated squared error (MISE) of 
the estimators. The upper bounds characterize how the rates of convergence 
depend on the entropy of the underlying function class T and on smooth- 
ness properties of the operator A. Previously, such characterizations have 
been given (up to our knowledge) in inverse problems only for the case 
of estimating real-valued linear functionals L. In these cases, the rates of 
convergence are determined by the modulus of continuity of the functional 
oj{e) = sup{L{f):f £ T, \\Af\\2 < e}; see Donolio and Low (1992). For the 
case of estimating the whole function with a global loss function, the rates of 
convergence depend on the size of the underlying function class in terms of 
the entropy and capacity; see Cencov (1972), Le Cam (1973), Ibragimov and 
Hasminskii (1980, 1981), Birge (1983), Hasminskii and Ibragimov (1990), 
Yang and Barron (1999), Ibragimov (2004). 6-net estimators were consid- 
ered by, for example, van der Laan, Dudoit and van der Vaart (2004). These 
papers consider direct statistical problems. We show that for inverse sta- 
tistical problems, the rate of convergence depends on the operator through 
the operator norm g{Q,Ts) of over a minimal 6-net Ts; see (8) for the 
definition of q{Q,J'5)- More precisely, the convergence rate tpn of the (5- net 
estimator is the solution to the equation 

ni^l = (Q, -^Vn ) log(#-F^„ ) , 

where #^^„ is the cardinality of a minimal (5-net. For direct problems, when 
A is the identity operator, q{Q,Ts) ^ 1- (We write a„ x 6„ to mean that 
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< lim inf„_^oo ffln/^n < limsup„_j.oo a„/6„ < oo.) As examples of operators 
A, we consider the convolution operator and the Radon transform. For these 
operators, the estimators achieve the minimax rates of convergence over 
Sobolev classes. 

The general framework for empirical risk minimization and the use of the 
empirical process machinery, including entropy bounds, for deriving optimal 
bounds seems to be new. Convolution and Radon transforms are discussed 
for illustrative purposes. These examples show that our results lead to opti- 
mal rates of convergence. As a new application, we introduce the estimation 
of additive models in inverse problems. A new type of oracle inequality is 
presented, which also gives the optimal rates of convergence in "anisotropic" 
inverse problems. For an extended version of this paper that also contains 
additional material, see Klemela and Mammen (2009). 

The paper is organized as follows. Section 2 gives an upper bound for 
the MISE of the 5-net estimator. Section 3 gives a lower bound for the 
MISE of any estimator. Section 4 gives an upper bound for the MISE of the 
dense empirical risk minimizer. Section 5 proves that the (5-net estimator 
achieves the optimal rate of convergence in the ellipsoidal framework and 
discusses this result for the case where ^4 is a convolution operator or the 
Radon transform. Furthermore, it contains an oracle inequality for additive 
models. Section 6 contains the proofs of the main results. 

2. d-net minimizer. Let be a set of densities or signal functions / : R'^ — )• 
R. Let J-s be a finite (5-net of J- in the L2-metric, where 6 > 0. That is, for 
each f & J^, there is a (p £ J-'s such that ||/ — (/>||2 < (5. Define the estimator 
/by 

/ = argmin7„ ((/)), 

where ^n{<P) is defined in (3). Typically, we would like to choose a (5-net of 
minimal cardinality. We assume that J- is bounded in the L2-metric: 

(7) sup||g||2 <-B2, 

where Q < B2 < oo. 

Theorem 1 gives a bound for the mean integrated squared error of the 
estimator. We may identify the first term in the bound as a bias term and the 
second term as a variance term. The variance term depends on the operator 
norm of Q over the (5-net J-s- We define this operator norm as 



(8) 
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where Q is defined by (4). In the case of density estimation, we need the 
additional assumptions that g{Q,J-s) ^ 1 and that AJ^ and QJ^ are bounded 
in the L^o metric: 

(9) g{Q,Ts)>l, supP/|U<i?oo, sup\\Qf\\^<B'^, 
where < Boo , B'^^ < oo . 

Theorem 1. For the density estimation, we assume that (9) is satisfied. 
For f ^ T , we have that 

E\\f -f\\l< C,5' + c,'^^^2l±^^^3MZ^l+A^ 

Til 

where 

(10) Ci = (1-20^^(1 + 20, 

(11) c72 = (i-20-'ea, 

(12) a > 

and ^ is such that 



C-i(4S^/3 + 72[8CB^)V9 + a^) < e < 1/2, 

(13) ^ density estimation, 
^IjC'T < ^ < 1/2, white noise. 

A proof of Theorem 1 is given in Section 6.2. 

Remark 1 . Theorem 1 shows that the 5-net estimator achieves the rate 
of convergence "0^ when is the solution of the equation 

(14) i,l^n-^Q^{Q,T^:)\o^{j^T^:). 

We calculate the rate under the assumptions that log(#7"5) and q{Q,Fs) 
increase polynomially as 5 decreases: we assume that one can find a 5-net 
whose cardinality satisfies 

log(#J-5) = 

for some constants 6, C > and we assume that 

q{Q,Ts) = C'5-'' 

for some a, C > (in the direct case a = and C" = 1). Then (14) can be 
written as ip"^ x n~^ip~'^°'~^ and the rate of the 5-net estimator is 

(15) V'n>^n-V[2(a+l)+6]_ 

Let be a set of s-smooth, d-dimensional functions such that b = d/ s. Then 
the rate is ipn x ^ which, for the direct case a = 0, gives the 

classical rate V'n ^ ji-s/i'is+d) ^ 
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3. A lower bound for MISE. Theorem 2 gives a lower bound for the 
mean integrated squared error of any estimator when estimating densities 
or signal functions / : R'^ — )■ R in the function class T . Theorem 2 also holds 
for nonlinear operators. 

Theorem 2. Let A he a possibly nonlinear operator. Assume that for 
each sufficiently small 6 > 0, we find a finite set T)^ d T for which 

(16) min{||/ - gh : f,g €Vs, f ^ g} > Co6 

and 



(17) 



max{\\f — g\\2 : f, g GVs} < Ci6, white noise, 
max{Dx(/, g): f,g & Vs} < CiS, density estimation, 



where Dj^^{f,g) = Jlog^{f/g)f is the Kullback-Leibler distance and Cq, Ci 
are positive constants. Let 



QK{A,Vi) 



1 \\A{f-g)h . 

—p^ max — — — , white noise, 

V2f,aeVsj^g II/-5II2 

DK(Af,Ag) , . . , 

max — — , density estimation. 



Let tpn be such that 

(18) log,{#T^^ J >ni^lQl{A,V^ J, 

where an ^ bn means that liminf„_j>oo cin/bn > 0. Assume that 

(19) lim niplgj^iA^V^J =cc. 

n— ^-oo 

Then 

lim inf ^-2infsupi?||/-/||i>0, 

where the infimum is taken over all estimators. That is, tpn is a lower bound 
for the minimax rate of convergence. 

A proof of Theorem 2 is given in Section 6.3. 

Remark 2. Theorem 2 shows that one can get a lower bound ■0„ for 
the rate of convergence by solving the equation 

(20) Vn^lc(^,^V'n) ^^"'lOge(#^Vn)- 

The upper bound in Theorem 1 depends on the operator norm of Q, defined 
in (8), whereas the lower bound depends on the operator norm of A. Note, 
also, that the operator norm q{Q,F^„) is on the other side of the equation 
in (14) compared to the operator norm qk{A,V^^) in the equation (20). 



EMPIRICAL RISK MINIMIZATION 



7 



Remark 3. In the density estimation case, one can easily check assump- 
tions (17) and (19) if one assumes that the functions in AT>s are bounded 
and bounded away from 0. Then 

(21) C ■ \\A{f - g)h < DK{Af,Ag) < C ■ \\AU - g)h 

and (17) and (19) follow by the corresponding conditions with Hilbert norms 
instead of Kullback-Leibler distances. 

4. Dense minimizer. The dense minimizer minimizes the empirical risk 
over the whole function class T. In contrast to the estimator, the 

minimization is not restricted to a (5-net. We call this estimator a "dense 
minimizer" because it is defined as a minimizer over a possibly uncountable 
function class. The (5-net estimator is more widely applicable: it may also be 
applied to estimate nonsmooth functions and it may be applied when the 
operator is severely ill-posed. The dense minimizer may only be applied for 
relatively smooth cases (the entropy integral has to converge). Because it 
works without a restriction to a 5-net, we have available a larger toolbox of 
numerical algorithms that can be applied. 

For a collection T of functions / : R'^ — )• R, the dense minimizer / is 
defined as a minimizer of the empirical risk over J^, up to e„ > 0: 

7n(/) < inf 7„(5) +en, 

where 7n(</') is defined in (3). For clarity, we present separate theorems for 
the Gaussian white noise model and for the density estimation model. In 
both models, we make the assumption that the functions in T are bounded 
in the L2-metric as in (7). 

4.1. Gaussian white noise. Let J-'s, 5 > 0, he a 5-net of J-', with respect 
to the L2-norm. Define 

(22) giQ,Ts) = maxj ^^j'' : / £ £ J'2S,f + g}, (5 > 0, 

where Q is the adjoint of the inverse of A, defined by (4). Define the entropy 
integral 

(23) G(b) f Q{Q,Fu)y/\ogM^u) du, 5g(0,B2], 

JO 

where B2 is the L2-bound defined by (7). 

Theorem 3. Assume that: 
1. the entropy integral in (23) converges; 
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2. G{6)/6'^ is decreasing on the interval (0,-62]/ 

3. giQjJ^s) = 06""", where < a < 1 and c > 0; 

4. lims^oG{5)5^ ^^ =00; 

5. (5i— g{Q, Fs) \/\ogg{^Fs) is decreasing on (0,i?2]- 
Let tpn he such that 

(24) V'n>C^"'/'G(Vn), 

where C is a positive constant, and assume that lim„_>>oo 'T-V'n''^'''"^ =00. 
Then, for f , 

i?||/-/||B^^'(V'^ + en) 

for a positive constant C , for sufficiently large n. 
A proof of Theorem 3 is given in Section 6.4. 

Remark 4. Assumption 5 is a technical assumption which is used to 
replace a Riemann sum by an entropy integral. We prefer to write the as- 
sumptions in terms of the entropy integral in order to make them more 
readable. 

Remark 5. We may write q{Q,J-s) in a simpler way when there exist 
minimal (5-nets J-'s which are nested: J-25 C J-s- We may then define, alter- 
natively, 

tn \\Qif-9)h 

Q[Q,J^S)= max — — . 

f,g<^J'sJ^g II/-5II2 

Remark 6 . Theorems 3 and 4 show that the rate of convergence of the 
dense minimizer is the solution of the equation 

(25) ^l = n-'/^G{^Pn). 

To get the optimal rate, the net J-s is chosen so that its cardinality is mini- 
mal. In the polynomial case, one can find a 6-net whose cardinality satisfies 

log(#J-5) = 

for some constants b,C > and the operator norm satisfies 

for some a, C > 0. (In the direct case, a = and C = 1.) Thus, the entropy 
integral G{S) is finite when u^'^^^/'^ du < 00, which holds when 

(26) a + h/2<l. 
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Then (25) leads to x n ^^'^ipn"' ^''^'^'^ and the rate of the dense mini- 
mization estimator is 

(27) ^„>cn-i/[2(-+i)+''l. 

This is the same rate as the rate of the J-net estimator given in (15). We have 
the foUowing example. Let be a set of s-smooth, d-dimensional functions 
such that b = d/ s. Condition (26) may then be written as a condition for the 
smoothness index s: s> d/[2{\ — a)]. When the problem is direct, then a = 
and we have the classical condition s > d/2. The rate is V'n ^ fi-s/l2{a+i)s+d] ^ 
which gives, for the direct case a = 0, the classical rate ipn ^ n~^/^'^^~^'^\ 

4.2. Density estimation. A (5-bracketing net of J- with respect to the 
L2-norm is a set = {{g^ ,g^) : j = 1, . . . ^N^} of pairs of functions such 
that: 

1. \\gj -g'jh<5, 3 = l,---,Ns; 

2. for each g ^ there exists j = j(5) S {1, . . . , iV^} such that g^ <:g < g^ ■ 

Let us define = {g^ : j = h ■ ■ ■ , N^} and = {g^ : j = h ■ ■ ■ . Ns}. Fur- 
ther, define 

(28) £>dcn(Q,-^5) = max{£>(Q, J-f, ), q{Q , F^ , F^^)] , 
where 

,(Q, Ft F^) = maxj "^fji"^"^ : / G F^g"" G ] 

\^ \\Q g Wi ) 

and 

x-L x-Ln f IIQ(/ - 5)112 t ^ tL ^ -tL f _l \ 
Q{Q,Fs ,^25 =max<^ — — -.f eFs ,g &F25J ^9> 

for S > 0. Define the entropy integral 
,■5 

(29) G{6)= Qden{Q,J'u)V^og,{#Fu)du, Se{0,B2], 

Jo 

where B2 = supjgjr||/||2. 

Theorem 4. We make assumptions 1-5 of Theorem 3 [with operator 
norm QdeniQ^^s) in place of q{Q,J~5)] and, in addition, we assume that 
supjgjr||j4/||oo < 00, sup^gjrL yjjru HQ^Hoo < CO and that the operator Q pre- 
serves positivity (g >0 implies that Qg >0). Let tpn he such that 

(30) Vn > Cn-V2G(^„) 
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for a positive constant C and assume that lim^^oo ?^V'n^^^"^ = oo. Then, for 

E\\f-fg<C'{^l^l + en) 
for a positive constant C , for sufficiently large n. 

A proof of Theorem 4 is given in Section 6.5. An analogous discussion 
of optimal rates as in Remark 6 for the Gaussian white noise model also 
applies for dense density estimators. 

5. Examples of function spaces. In Section 5.1, we consider ellipsoidal 
function spaces and in Section 5.2 we consider additive models and their 
generalizations . 

5.1. Ellipsoidal function spaces. Since we are in the L2-setting, it is natu- 
ral to work in the sequence space; we define the function classes as ellipsoids. 
We shall apply singular value decompositions of the operators and wavelet- 
vaguelette systems in the calculation of the rates of convergence. In Section 
5.1.1, we calculate the operator norms in the framework of singular value 
decompositions. In Section 5.1.2, we calculate the operator norms in the 
wavelet-vaguelettte framework. Section 5.1.3 derives the rate of convergence 
of the (5-net estimator for the case of a convolution operator and the Radon 
transform, and the lower bound for the rate of convergence of any estimator. 

5.1.1. Singular value decomposition. We assume that the underlying func- 
tion space consists of d-variate functions that are linear combinations of 
orthonormal basis functions with multi-index j = (ji, . . . ,jd) S {0, 1, . . .j"^. 
Define the ellipsoid and the corresponding collection of functions by 



oo 



(31) e=<^^: Yl "M^^'f' E 9j^j--0ee\. 

I ii=0,...,jd=0 J [ji=0,...,ja=0 ) 

5-net and 5-packing set for polynomial ellipsoids. We assume that there 
exist positive constants Ci, C2 such that for all j G {0, 1, . . .}'^, 

(32) Ci-\j\'<aj<C2-\j\', 

where \j\ = ji + ••• + jd- In Klemela and Mammen (2009), we construct 
a 5-net Qs and a (5-packing set 0^ using the techniques of Kolmogorov 
and Tikhomirov (1961); see also Birman and Solomyak (1967). Since the 
construction is in the sequence space, we define the 5-net and 5-packing set 
of by 

(33) j-5 = | ej4>j:9ees\, Vs = l 0j4>j:0ee*s\. 

[ji=0,...,jd=0 ) Ui=0,...jd=0 J 
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The set O5 is such that for 6 £ Qs, 

9j = when j ^{1,...,mY, 

where 

(34) Mxr^/^ 
The set 0^ is such that for ah 9 £ Q'^, 
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(35) 9, =9* 

where 6* is a fixed sequence with X]u|>oo|^|^ = L* < L and where M* 



when j ^ {M*,...,My 

^00 



[M/2]. Furthermore, it holds that 
(36) log{#es) < C5~''/\ log(#e^) > c's-"''. 

Operator norms. We calculate the operator norms g{Q,Ts) and qk{A, 
T>s) in the ellipsoidal framework, where J-5 and T>s are defined in (33). We 
apply the singular value decomposition of A. We assume that the domain 
of ^4 is a separable Hilbert space H with inner product (•,•). The underly- 
ing function space T satisfies T C H. We denote by A* the adjoint of A. 
We assume that A* A is a compact operator on H with eigenvalues (bj), 
bj > 0, j G {0, 1, . . .}'^, with an orthonormal system of eigenfunctions We 
assume that there exist positive constants q and Ci,C2 such that for all 

iG{o,i,...}^ 



(37) 



ci-\jr''<bj<G 



2 ■ 



Let g,g' be elements of or of Vs, respectively. Write 



g-g 



jl = lvJd = l 



1. The functions 
and thus 



are orthogonal and ||(5</>j||2 = b-^ . Indeed, Q = {A ^)* 



t-3,^^l> = {<Pj,A ^(^ 

where we have used the fact that 

A^\A^'^)* 
Thus, for g,g' G Ts, 

M 

\\Q{g-g')f2 



-In*. 



A-\A*r'(t>i = {A*Ar\ 



pj,(pi), 



br%. 



ji=0,...,jd=0 
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M 



(38) = 

ji=0,...jd=0 

M 

< CM^" {9j 



|/n2 
ji=0,...,ja=0 



where we have used (37) to infer that when j G {0, . . . , M}*^, 
bf < Cf 2 . Ijf? < C7-2 . l^dMf'^. 

On the other hand, \\g — g'\\2 = S|f=o j^=o{9j — . This gives the 
upper bound for the operator norm 

(39) q{Q, Fs) < CM1 < C'd^'i'' 

by the definition of M in (34). 
2. The functions A(j)j are orthogonal and ||A</>j||2 = hj. Indeed, 

Thus, for g,g' G Vs, 

M 

(40) \\A{g-g')\\l= 

ji=M*,...,ja=M* 

This, together with calculations similar to those in (38), implies that 

(41) C'S"^' <gKiA,Vs)<C6'i/'. 

5.1.2. Wavelet-vaguelette decomposition. We assume that the underly- 
ing function space J- consists of d-variate functions which are linear com- 
binations of orthonormal wavelet functions {4>jk)j where j G {0,1,...} and 
k G {0, ... ,2^ — l}'^. The /2-body and the corresponding class of functions 
can now be defined as 

e = \e:Y E l^^'^l' ^ ^'1 ' = |E E ^.'^'^^•'^ ^ ^ e ®| ' 

^ j k ^ ^ j k ^ 

where s > 0. We have already constructed a 5- net and 5-packing set for the 
Z2-bodies in (33). Now, this is done such that for 6 GQg, 

9jk = when j > J + 1, 

where 

(42) 2^ X 5"^/' 
and such that for G 0^ , 

Ojk = 0*k when j < J* or i > J + 1 , 
where 6* is a fixed sequence with J2'jLo Sfc ^j^jk = L* <L, and J* = J -I. 
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Operator norms. We can apply the wavelet-vaguelette decomposition, as 
defined in Donoho (1995), to calculate the operator norms q{Q,J^5) and 
gxiAjDs). We have available the following three sets of functions: {(j)jk)jk 
is an orthogonal wavelet basis and {ujk)jk and {vjk)jk are near-orthogonal 
sets: 



^OjkUj 
jk 



ik 



[(^jk)\\h, 



^ ajkVjk 

jk 



{"-jkjWh, 



where a x 6 means that there exist positive constants C, C such that Cb < 
a < Cb. The following quasi-singular relations hold: 

A(/)jk = KjVjk, A*Ujk = Kj<j)jk, 

where kj are quasi-singular values. We assume that there exist positive con- 
stants q and Ci, C2 such that for all j G {0, 1, . . .}, 

(43) Ci ■ 2-"^ < Kj < C2 ■ 2-'^^ . 

1. Let g,g' £Ts. Write 

J 

9-9' = ^^i^jk - Sjk)<Pjk- 
j=0 k 

Since Q = (A-^)*, it holds that QA* = (AA-^)* = I. Thus, 



{Q4>jk, 



'j'k' 



'^Kj^{QA*Ujk,QA*Uj'k' 



^Kj^{Ujk,Uj'k')- 



This gives that 



\\Q{9-9')f2 



(44) 



^Y.^O,k-0'^k)Q^jk 

j=0 k 



- E -J" E(^.^ - ^;-^)' ^ ^2^"' E E(^.^ - ^;-^)^ 

j=0 k j=0 k 

where we have used (43) to infer that for j G {0, . . . , J}, it holds that 

KT2<Cr2.22«<Cr2.22'?'^. 

On the other hand, — g'||2 = Z]j=o Ylki^jk ~ ^jk)^- This gives the upper 
bound for the operator norm 

^(Q,-^<5)<C2'?^<C"r«/" 

by the definition of J in (42). 
2. We have {Acpjk, Acpj/k') = K,jKj'{vjk,Vjik') and {vjk) is a near-orthogonal 
set. Thus, similarly as in (44), we get 
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5.1.3. Rates of convergence. We derive the rates of convergence for the 
6-net estimator when the operator is a convolution operator or the Radon 
transform. It is also shown that the lower bounds have the same order as 
the upper bounds. We will do this for the Gaussian white noise model. 

Convolution. Let A be a convolution operator: Af = a* f, where a : YC^ — )• 
R is a known function and where a * f{x) = J-^^ a{x — y)f{y) dy is the con- 
volution of a and /. For j G {0, 1, . . .j*^, k £ Kj = {k e {0, l}'^ : fcj = 0, when 
ji = 0}, denote 

d 

4>jk{x) = \/2[(l - ki) cos{2Tr jiXi) + ki sm{2Tr jiXi)], x G [0, 1]*^. 

i=l 

The cardinality of Kj is 2'^~°'^^\ where q(j) = = 0}. The collec- 

tion {(pjk)! {j,k) G {0,1,...}'^ X Kj, is a basis for 1-periodic functions on 
L2([0, l]'^). When the convolution kernel a is a 1-periodic function in L2{[0, l]'^), 
then we can write 

oo 

a{x)= ^ bjk(pjk{x). 

ji=0,...,jd=OkeKj 

The functions (pjk are the singular functions of the operator A and the values 
bjk are the corresponding singular values. We assume that the underlying 
function space is equal to 

(45) f; ^9jkMx):{ejk)e@\, 
where 

(46) e=(^: f; E-.V|.<^4. 

We give the rate of convergence of the (5-net estimator and show that the esti- 
mator achieves the optimal rate of convergence. Optimal rates of convergence 
has been previously obtained for the convolution problem in various settings, 
in Ermakov (1989), Donoho and Low (1992), Koo (1993), Korostelev and 
Tsybakov (1993). 

Corollary 1. Let J- he the function class as defined in (45). We as- 
sume that the coefficients of the ellipsoid (46) satisfy 



Co\j\'<ajk<Ci\j\' 
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for some s > and Cq,Ci > 0. We assume that the convolution filter a is 

1- periodic function in L2{[0, l]"^) and that the Fourier coefficients of filter a 
satisfy 

C2\j\-''<bjk<C3\j\-' 
for some q>0, C2,Cs > 0. Then 

limsupn2^/(2s+2g+d) supE/ll/ - / ll^ < oo, 

where f is the 5-net estimator. Also, 

lim inf n'^'/^'^'+^'J+'^^ inf sup E f\\g - f\\l>0, 

n-^oo g j^jr 

where the infimum is taken over any estimators g. 

Proof. For the upper bound, we apply Theorem 1. Let Ts be the 5-net 
of J-" as constructed in (33). We have shown in (39) that q{Q,Fs) < C6~'^, 
where a = q/s. We have stated in (36) that the cardinahty of the 5-net 
satisfies log{f^Ts) < C6~^, where b = d/s. Thus, we may apply (15) to get 
the rate ipn = _ ^-s/(2s+2Q+d)_ ^Yhis shows the upper bound. 

For the lower bound, we apply Theorem 2. Assumption (16) holds because 
Ds in (33) is a (5-packing set. Assumption (17) holds by the construction; 
see Klemela and Mammen (2009). Assumptions (18) and (19) follow from 
(36) and (41). Thus the lower bound is proved. □ 

Two-dimensional Radon transform. We consider reconstructing a two-di- 
mensional function / from observations of its integrals over lines, that is, 
from its Radon transform. We suppose that / € Li{D) n L2{D), where D = 
{x € : ||x|| < 1} is the unit disk in R^. We parametrize the lines by the 
length u £ [0, 1] of the perpendicular from the origin to the line and by the 
orientation G [0, 27r) of this perpendicular. A common way to define the 
two-dimensional Radon transform is 

(47) Af{u,(f>) = — ^^=^= / /(ncos(/i — t sin(/), iisin</) -|- 1 cos</>) (it, 

2V1 -n2 Jv'Ti:^? 

where {u,(j)) G Y = [0, 1] x [0, 27r]. Now, the Radon transform is vr times 
the average of / over the line segment that intersects D. We consider Rf 
as the element of L2(Y,z^), where v is the measure defined by dv[u,(j)) = 

2- n-^y/l-u'^dud(l). 

The singular value decomposition of the Radon transform can be found 
in Deans (1983). Let 

<^,,(r, e) = vr-V2(, + k + l)V2zl^;,^l(,)e^0-fc)^, 

(r,0)GZ)=[O,l] X [0,27r), 
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where Z\ denotes the Zernike polynomial of degree a and order h. Functions 
4>jk^ jjk = 0,1, . . . , (j, k) 7^ (0, 0), constitute an orthonormal complex- valued 
basis for L2{D). The corresponding orthonormal functions in L2(Y,z^) are 

i^.kiu, ^) = T^-^'^U,+k{u)e'^'-^^t {u, 4>)eY = [0, 1] X [0, 27r), 

where Um{cosO) = sin((m + 1)0)/ sinO are the Chebyshev polynomials of the 
second kind. We have 

where the singular values are 

(48) bjk = Tr-Hj + k + l)-'/^. 

The complex basis identifies the equivalent real orthonormal basis as follows: 

(V2Re{^jk), ifj>A:, 
(l)jk= I (i>jk, 3=k, 
[V2Im{4>jk), ifj<k. 

We assume that the underlying function space is equal to 

(49) T=l f; 0nj2'Pnni^)--i(^nn)^&\^ 

^jl=0,j2=0,ijuj2m0,0) ) 

where 

(50) G=L. f; al^An<L']- 

We give the rate of convergence of the 6-net estimator and show that the 
estimator achieves the optimal rate of convergence. Optimal rates of con- 
vergence have previously been obtained in Johnstone and Silverman (1990), 
Korostelev and Tsybakov (1991), Donoho and Low (1992), Korostelev and 
Tsybakov (1993). 

Corollary 2. Let T be the function class as defined in (49)- We as- 
sume that the coefficients of the ellipsoid (50) satisfy 

Co\jY<ajk<Ci\jY 
for some s > and Co, Ci > 0. Then, for d = 2, 

limsupn2'^/(2s+2d-i) supE^II/ - / < oo, 

where f is the 5-net estimator. Also, 

lim inf n^'/^'^'+^'^-^hnf sup E f\\g - f\\l>0, 

where the infimum is taken over any estimator g. 
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Proof. For the upper bound, we apply Theorem 1. Let J-s be the 5-net 
of T as constructed in (33). We have shown in (39) that 

where a = q/s and q = 1/2 [so that a = {d— l)/(2s)] since the singular values 
are given in (48). We have stated in (36) that the cardinality of the 6-net 
satisfies 

where b = d/s. Thus, we can apply (15) to get the rate 

V.„ = n-^/(2.+2d-i)_ 

The upper bound is proved. For the lower bound, we apply Theorem 2 
similarly as in the proof of Corollary 1. □ 

5.2. Additive models. In this section, we will show that our approach can 
be used to prove oracle results for additive models. In additive models, the 
unknown function / : R'^ — t- R is assumed to have an additive decomposition 

f{x) = fi{xi) H h fd{xd) with unknown additive components : R — )• R, 

j = 1, . . . ,d. We compare this model with theoretical oracle models where 
only one component function is unknown, the other functions fj (j 7^ r) 
being known. We will show below that the function / can be estimated with 
the same rate of convergence as in the oracle model that has the slowest rate 
of convergence. In particular, if the rate of convergence is the same in all 
oracle models, then the rate in the additive model remains the same. This is 
a well-known fact for classical additive regression models; see, for example. 
Stone (1985). It efficiently avoids the curse of dimensionality, in contrast 
to the full-dimensional nonparametric model. Furthermore, it is practically 
important because it allows a flexible and nicely interpretable model for 
regression with high-dimensional covariates; see, for example, Hastie and 
Tibshirani (1990) for a discussion of the additive and related models. Thus, 
our result will generalize the oracle result for additive models of Stone (1985) 
to inverse problems. For a theoretical discussion, we will first use a slightly 
more general framework. We will later return to additive models. 

5.2.1. Abstract setting. We assume that the function class is a subset 
of the direct sum of spaces J^i, . . . ,J-p. All spaces contain functions from 
R*^ — )• R. At this stage, we do not assume that functions in {j = 1, . . . ,p) 
depend only on the argument Xj. Examples of this more general set-up are 
sums of smooth functions and indicator functions of convex sets or of sets 
with smooth boundary. We assume that a finite 5-net J-'s of is a subset 
of the direct sum Ti s ® • • • © -^p,5, where Tj^s are finite subsets of Tj. We 
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denote the number of elements of Tj^i, by exp(Aj). Furthermore, we write 
Pj = p{Q,Tj^s)- We make the following, essential, geometrical assumption: 

(51) \\fi + --- + fp\\l>ci2\\fjf2 

for a positive constant c > 0. For the (5- net minimizer / over the 6-net J-'s, 
we get the following result in the white noise model. (An additive model for 
density estimation would not make much sense.) 

Theorem 5. We make assumption (51). In the white noise model, the 
following hound holds for the 6-net minimizer f, for f £ J^, 

E{\\f-f\\l)<36'' + 32c-\~ 
A proof of Theorem 5 is given in Section 6.6. 

5.2.2. Application to additive models. We now apply Theorem 5 in or- 
der to discuss additive models f{x) = fi{xi) + • • • + fd{xd)- In -^^2(R-'^)) we 
have ll/i + • • • + fdWl = EU Il/ill2 if funct ions fj are normed such that 
J fj{xj)dxj = 0. Thus, (51) holds trivially. Assumption (51) also holds in 
other L2-spaces with dominating measure differing from the Lebesgue mea- 
sure. A discussion of condition (51) for these classes can be found in, for 
example, Mammen, Linton and Nielsen (1999); also, see Bickel et al. (1993). 
Such L2-spaces naturally arise in additive regression models. For a white 
noise model, they arise if one assumes an additive model for transformed 
covariables. We assume that for the models J^j, one can find 5j-nets J^j^s 
such that choosing 6j = tpnj with 

V'nj - n-^p'^iQ, J^j,i>,,j ) log(#-7='j,V'n,, ) 

gives a rate-optimal 6-net minimizer in the model J-'j. Now, = ^i,5i © 
• • • © Td^Sa is a 6-net of T with 6 = Yl'j=i ^j- From Theorem 5, we get that 
the 6-net minimizer / over the net J-'s achieves the rate 0{ipn) with ipn = 
maxi< j<dfpn,j- This is just the type of result we called an oracle result at 
the beginning of this section. 

In general, the oracle result does not follow from Theorem 1. The appli- 
cation of Theorem 1 leads to an assumption of the type 

max p2(Q, jr.^^^) x max log(#J'j-^,^ .) = ©(V'^), 



,7 = 1 



V; = l 
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whereas Theorem 5 only requires that 

This can make a big difference. First, the entropy numbers of the additive 
classes J-'j may differ. Furthermore, the operator Q may act quite differently 
on the spaces Tj. 

5.2.3. Ellipsoidal spaces and convolution. As an example, we now assume 
that the underlying function space is = © • • • © -F^, where 

-Ffc = I X] ^kj(l)kj ■■ Ok- G 0Sfe,Lfe I 
for basis functions (pkj ■ [0, 1] — R and the ellipsoids are defined by 

(52) Gs„L, = l^'^- ^ E ^ ^fc } ' k = l,...,d, 

where we assume that there exist positive constants Ci,C2 such that for all 

iG{o,i,...}, 

(53) Ci-f''<ak,<C2-f'. 

Let A be a convolution operator: Af = a* f, where a : R'^ — )• R is a known 
function. Then 

Af = Aifi + --- + Adfd, 
where f{x) = fi{xi) H h fd{xd) and 



Akfk{xk) = / fk{xk - yk)ak{yk) dyk, 

where ak{yk) = i-^d a{y) nf=i /^fc ^yi is the kth. marginal function of a. We 
can decompose Q accordingly: 

Qg = Qm -\ \- Qd9d- 

Operators Aj and Qj are restrictions of A and Q to J^j. We apply the 
singular value decomposition for Af^. Let 

<j)kj{t) =V2cos{27rjt), t€[0,l], 

where j = 1,2, . . . and (po(t) = I[o,i] (0- The collection {(pkj), j = 0, 1, . . . , is a 
basis for 1-periodic functions on L2{[0, 1]). When are 1-periodic functions 
in L2([0, 1]), we can write 

oo 

a-kixk) = y^^hj4>kj{xk)- 

3=0 
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The functions (j)kj are the singular functions of the operator and the 
values bkj are the corresponding singular values. We give the rate of con- 
vergence of the 5-net estimator and show that the estimator achieves the 
optimal rate of convergence. 

Corollary 3. Let J-" = © • • • © J-"^. We assume that the coefficients 
of the ellipsoid satisfy (53). We assume that are 1-periodic functions in 
L2([0, 1]) and that the Fourier coefficients of at satisfy 

for some qt > 0, C2,C3 > 0. Then, in the white noise model, 
limsupn" sup^/||/ - /II2 < 00, 

n— !>oo /eJ-' 

where f is the 5-net estimator and 

a = mm . 

k=i,...,d 2sk + 2% + 1 

Also, 

lim inf n*^ inf sup -Bf ||^ - /II2 > 0, 

n^oo g j^jr 

where the infimum is taken over any estimators g in the white noise model. 

Proof. For the upper bound, we apply Theorem 5. As in Section 5.1.1, 
we can find (5-nets Fk^& for Fk whose cardinality is bounded by \og{^Fk,s) < 
C75-V«fe and Q{Qk,H,5) < Cd-'^'^/'K The upper bound of Theorem 5 gives 
as the rate the maximum of the component rates n"^'^*^/^^**"'"^'"^"'"^). For the 
lower bound, we apply the lower bound of Corollary 1 in the case d = l and 
the fact that one cannot do better in the additive model than in the model 
that has only one component. □ 

6. Proofs. 

6.1. A preliminary lemma. We prove that the theoretical error of a min- 
imization estimator may be bounded by the optimal theoretical error and 
an additional stochastic term. 

Lemma 1. Let C C L2(R'^). Let f be such that 
(54) 7n(/)< inf 7n 

where e > 0. Then, for each f^ £C, 

||/-/||i<||/°-/||i + e + 2i^„[Q(/-/0)], 
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where f is the true density or the true signal function and Vnig) is the 
centered empirical operator: 



(55) Unig) = < 



j 9dYn- j g{Af), 

n „ 



i=l 



white noise model, 
density estimation, 



where g : R'^ — t- R. 

Proof. We have, for g = f, g = f^, 

\\g - fWl - in{g) 

'2-2^^/5 + 2^ {Qg)dYn, 



2 fg + 2n~'^iQg){Y,), 



white noise model, 
density estimation. 



i=l 



We have J^, fg = f^{Af){Qg). Thus, 

(56) 11/ - fWl - InU) + Inif) - \\f - fWl = 2MQif - f)]- 

Thus, 

11/ -fWl- \\f -f\\l= 11/ -fWl- ln{f) + 7n(/) - ||/° - fWl 

(57) < 11/ - fWl - inU) + ln{f) + e-\\f-f\\l 

(58) =2un[Q{f-f)]+e. 

In (57), we apphed (54) and in (58), we apphed (56). □ 

6.2. Proof of Theorem 1. Let / G be the true density. Let cfP ^ Fs- 
Define C = Ci||(/>° - f\\l + C2n-^Q\Q,F5) loge(#J"5), where Ci is defined in 
(10) and C2 is defined in (11). We have that 

E\\f-f\\l 



(59) 



mi/-/lli>*)^^*<c+ / p{\\f-f\\l>t)dt 



C + C2n-'e\Q,Fs) / P{\\f-f\\i>C2n-'e'{Q.J'5)t + C)dt. 



Define = CrU ^ g'^{Q,Fs){loge{#Fs) +t), where is defined in (12). 
Then 



P{\\f - f\\2>C2n-'Q\Q,Ts)t + 
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P{\\f-f\\l>Ci\\4>''-f\\l + C2C-\n) 



(60) =P((l_20^l||/-/|| 



> 2^(1 - 20^'\\f - /Hi + Ciii/ - fWl + C2C;\n 

p{\\f -f\\l> mf - /Hi + (1 + 2011/ -f\\l+ er, 



n)- 



We have, by Lemma 1, that ||/ - f\\l < ||</.° - f\\l + 2m„[Q(/ - /)]. This 
imphes that 

P{\\f - f\\l>C2n-'0\Q,^5)t + C) 

= PiMQif - /)] > m - fwl + cii/ - /Hi + ern/2) 

(61) =p(^n[Q(/-.^°)]>^«(/)e) 

<p( max -"[^(t?°^^>e 

def p 
— -'max; 

where w{(l)) = - f\\l + ||(/>° - /||i + r„/2. We wih prove that 

(62) Pmax<exp(-t). 
Together with (59) and (61), this proves the theorem. 

Proof of (62). Define 



We have that 



t«((p) 



(63) ^'max< J^P(z^n(5)>e)- 

see 

Also, u;((^) > i(||0 - /Hi + r„) > ^ - (jPhTn^"^ and thus 

(64) .o'^=^maxH,Hi<- max WQif - ffl = , J^^) _ 

Gaussian white noise. When W ~ A^(0,(T^), we have P{W > Cj < x 
exp{— ,^^/(2(7^)} for ^ > 0; see, for example, Dudley (1999), Proposition 
2.2.1. We have that i^nig) ^ N{0,n^^\\g\\l). Thus, 

PW.)>0£2->„p{-|!}s2-„p{-^^}. 
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Defining Q = we get that 

Pmax < tJ's ■ ^^P{-^^^^|%) } = • exp{-Q[log,(#J-5) 

< exp(— i) 
since > 1, by the choice of ^. 

Density estimation. Define v = sup^gg Varj((/(Yi)) and b = sup^^g Halloo- 
We have that 

(65) ^<P/||oo^o<i?oo ^'^^"^'^ 

by (64). Also, w((/>) > t„/2 and thus, because of q{Q,Ts) > 1, we have that 

(66) b<2B'^-<4B' j'^^'^'\ 

By applying Bernstein's inequality, we get, with (65) and (66), that 



P(.„M>?)<exp{^^-Z!|^} 



< exp 

Continuing from (63), 

-Pmax < #-7^5 • exp 



= ■ exp{-Q[log,(#J5) + t\] < exp{-t), 

where 

^ dcf S.'^Cr 
« " 2(i?oo + 45^^/3) 

and > 1, by the choice of We have proven (62) and thus the theorem. 

6.3. Proof of Theorem 2. To prove Theorem 2, we follow the approach 
of Hasminskii and Ibragimov (1990) and use Theorem 6 in Tsybakov (1998), 
which gives us the following lemma. For a proof, see Klemela and Mammen 
(2009). 

Lemma 2. Let T) d T he a finite set for which 
(67) min{||/-5||2:/,9G^,//5}>5, 
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where 6 > 0. Assume that for some fo £T> and for all f {/o}, 

^^Kdp'^j;] J 



for some 0<a<l, r>0. Here, Pj^j! is the product measure corresponding 
to the density Af in the density estimation model, and in the Gaussian white 
noise model, P^j is the measure of the process ¥„, in (2). It then holds that 

inisnp EAfWf - fWl > y (1 - «) iI^^V^\v 

where Ns = #P > 2. Here, the infimum is taken over all estimators (either 
in the density estimation model or in the Gaussian white noise model). 

Proof of Theorem 2. For /, /o G ^?Vn > / / /o, 

pin) ( (^^Afo ^ ^ 



Af 



(69) <{logr~r'DUPTf',PXfO 

{{logT^^)^^nDj^{Af,AfQ), density estimation, 
n 
{log T^^)^^ — \\Af — Afo\\2, Gaussian white noise, 

where, in (69), we apphed Markov's inequahty and for the Gaussian white 
noise model, in (70), we apphed the fact that under P^j , 



^ = expK/VZ + ncTV2}, 



where Z ~ A^(0, 1) and a = \\Af — A/o||2- When we choose 

T = Tn = (iyi^{-a^^n[CiQK{A,V^^)tprif} 

for < a < 1, we get, by applying assumption (17), that 
(71) 



P):i\-^)<^]< (logT-^)-^^^^(^,P^J||/ - /oll^ 



< {logT~Y^n[QK{A,V^JCM^ =a. 
By applying Lemma 2, assumption (16) and (71), we get the lower bound 

(72) inf sup 11/ - m > ^^(1 - a) " 



f flvj'' 4 ^ -'l + r„(iV^„-l)^ 
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where N^^ = Let n be so large that loggA''^„ > C|n£i|^(^,P^„)V'n, 

where C2 > Ci. This is possible by (18). Then 

TnN^^ = exp{loge iV^„ - a"^n[Ci £lx(^, ) V'n]^} 
> exp{ne|(A,P^jV'^[C| - a-^ClW ^ 00 

as n — )■ 00, where we apply (19) and choose a so that Cl - a-^Cf > 0, that 
is, (Ci/C2)2 < a < 1. Then lim„^oo r„(A^^,„ - 1)/[1 + T„(iV^„ - 1)] = 1 and 
the theorem follows from (72). □ 

6.4. Proof of Theorem 3. Let C = C'ie„ + C21PI, where Ci = {l- 2^)-\ 
C2 = 1 - 2^, < ^ < (3 - \/5)/4. We have that 



E\\f-f\\l= P{\\f-fg>t)dt 

(73) 

<C + C2^Pl. P{\\f-f\\l>C2^lt + C)dt. 
Jo 

With T„ = Crijlil + 1), = - 20^ this implies that 

(74) P{\\f - fWl > C^i^lt + = P{\\f - fWl > 2^11/ - /Hi + Crn + en). 

We have, by Lemma 1, choosing /° = /, that \\f - f\\l <2un[Q{f - f)] + en- 
This implies that 

(75) P(||/ -f\\l> C2^^lt + C) < Pf sup ''-^Q^S-f)] > def 

where w{g) = \\g — /II2 + t„/2. We will prove that 

(76) P,up<exp(-t-log,2). 
Together with (73) and (75), this implies the theorem. 

Proof of (76). We use the peeling device; see, for example, van de Geer 
(2000), page 69. For j > 0, let oq = r„/2, aj = bj = 2^aj and define the 

following sets of functions: Qj = {g £ T : aj < w{g) < hj], Tj = {g £ T: \\g — 
/Hi < ^j}- We have that 

00 

:F={g£j^:w{g)>ao}=\Jgj. 

j=0 



Thus, 



(77) 



00 

<^P( sup uMg-f)]>^c 



j=0 
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By assumption 4 of Theorem 3, G{4>n) = 24-v/2G(V'n)5 where G is defined 
in (81) for sufhciently large n. Thus, by the choice of C = ^~^4 • 24-v/2 
in (24), il^l > n-i/2^-^G'(V'n)- By the choice of we have that Cr > 2 
and thus oq = Cripni^ + t)/2> ip'^. Since G{5)/5'^ is decreasing, by assump- 
tion 2 of Theorem 3, G{5)/5^ is also decreasing and S^n^^'^/i > G{ipn)/'4^n ^ 
G(ay^)/ao > G{b]^^)/bj, that is, 

(78) Caj=Cbj/4>n-'/^G{b]/^). 
We now apply Lemma 3 stated below, with (78), to get 

(79) P(sup i^niQig - /)] > ^a,) < exp|- ^^f 

(80) < exp{-C"(j + l)nV^^(i+'^)(l + 

where C" = G'c^^C^2^'^''~^\Cr/2y+''. Here, we have used the facts that a|/ 
^i-a ^ 22{«-i)[22iC^Vn(l +i)/2]^+" and 22j('^+i) > j + 1. When < 6 < 1/2, 
we have ^"Ig < 2b. When n^^^^+''^ > (log^ 2)/C", we have exp{-C"?i x 
^2(1+")^^ _^^^i+a| < ;l/2. Now, we combine (77) and (80) to get the upper 
bound 

2exp{-C"n^^(^+'^)(l + t)^+''} <exp{-tloge2}. 

We have proven (76). For the proof of Theorem 3, it remains to prove 
Lemma 3 below. Lemma 3 gives an exponential tail bound for the Gaus- 
sian white noise model. 

Lemma 3. Let Vn be the centered empirical operator of a Gaussian white 
noise process. Operator Vn is defined in (55). Let Q C L2(R"') be such that 
supggg II5II2 < R and denote by Qs a 5-net of Q , 6 > 0. Assume that 6 ^ 

q{Q, G5)\/log^{#Qs) is decreasing on (0,i?], where g{Q,Gs) is defined in 
(22) and assume that the entropy integral G{R) defined in (23) is finite. 
Assume that q{Q,Gs) = c5~°' , where < a < 1 and c > 0. Then, for all 

(81) ^>n-^/^G{R), G{R) = max{24\/2G(i?), ci?^"" ^log^ 2/C"} 
with C' = 12-2(C")-2 and G" = (1 - a)-3/2r(3/2)(loge 2)-3/2^ we have 



p(supun{Qg) > n < exp 



A proof of Lemma 3 is given in the technical report Klemela and Mam- 
men (2009). The main argument makes use of the chaining technique. An 
analogous lemma in the direct case is, for example, Lemma 3.2 in van de 
Geer (2000). 
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6.5. Proof of Theorem 4- The proof is similar to the proof of Theorem 
3 up to step (79). At this step, we apply Lemma 4 stated below to get 

P( sup i^nlQig- f)] >Caj 

< exp + 2#gB, exp " 



The first term in the right-hand side is handled similarly as in the proof 
of Theorem 3. For the second term in the right-hand side, we have, for 
sufficiently large n, 

exp|-l .r^^^^ 1 = exp{-n^2(i+a)22,(i t)'-^a^"} 

since aj"" = (2^-5' ao)^"^ < Oq " and Og " > 1 for sufficiently large n, and we let 
C' = ^2ci+'^/[2i+'^12(Sooc2 2^B'^/9)]. The proof is completed similarly to 
the proof of Theorem 3. 

We have used Lemma 4, which gives an exponential bound for the tail 
probability in the case of density estimation. 

Lemma 4. Let Yi, . . . , 1^ E R*^ be i.i.d. with density Af and let the cen- 
tered empirical process Vn be defined as in (55). Assume that \\Af\\ao < i?oo- 
Let Q C L2(R'^) be such that supg^g\\g\\2 < R- Denote by Qs 6-bracketing 
net ofg, S>0. Letg^ = {g^:{g^,gU)€gs} and = {g^ : [g^^ , g^) e Gs} ■ 



Assume that sup g^gL^jgu \\Qg\\oo < B'^. Assume that 5 ^ QdcniQ, Gs ) \/loge ( #^<5 ) 
is decreasing on {0,R], where QdcniQ, Gs) defined in (28) and assume that 
the entropy integral G{R) defined in (29) is finite. Assume that 

QdeniQ^Gs) = cS-", where < a < 1 and c> 0. Put G{R) = B^^id"^ + 96 • 
2-2«)V2max-[24V2G(i2),4x (logg(2))-i(l - a)-3/2r(3/2)ci?^-'^}. Then, for 
all ^, > n~^/'^G(R), we have 

P( sup i^niQg) > (. 



< 4 exp 



+ 2#gRexp 



B^c'^R^-^^ 
1 



12 5ooC2i22(l-a) + 2^5^/9 

where m„ is the centered empirical process defined in (55). 

A proof of Lemma 4 is given in the technical report Klemela and Mammen 
(2009). The proof uses the chaining technique with truncation. The proof 



28 J. KLEMELA AND E. MAMMEN 

follows the techniques developed in Bass (1985), Ossiander (1987), Birge and 
Massart (1993), Proposition 3 and van de Geer (2000), Theorem 8.13. 

6.6. Proof of Theorem 5. We proceed similarly as in the proof of Theo- 
rem 1. Choose fs € such that ||/ — /5II2 < 5, where / is the underlying 
function in T. Choose C < 1/2 and put ^ = Ci + C2 with Ci = (1 - 2?)"^! + 
2011/ - fsWl C2 = i^n"^ Ei=i P]^j and k = Ac-^^-^l - 2C)-\ We have that 



(82) ^(||/_/||^)<c+ / P(||/_/||^>t + c)dt. 

Jo 

For the integrand of the second term, we have that 

P{\\f-f\\l>t + 

= p{\\f - fwi > 2eii/ - /Hi + (1 - 20t + (1 - 200- 

We now use Lemma 1. This gives 

ll/-/llBll/-Mli + 2^„(Q(/-/5)). 

Together with the last equalities this gives 

Pi\\f-f\\l>t + 

< P{\\f - fsWl + '^MQif - fs)) > 2^1/ - fWl + (1 - mt + 0) 

< PiMQif - fs)) > 2-'C\\f - fsWl + 2-^(1 - 20(t + C2)). 

Now, put Wj = Pj/Ylf=iPi and decompose /^ = /5,i H h f5,p and / = 

/i + • • • + /p with fsj^fj G J^j,s- Using assumption (51), we get, with /3j = 
2^1(1 - 2^){wjt + Kn-^p'jXj), that 

^(ll/-/lll>i + C) 

< ^( - A^)) > 2"'ec J] II/, - f5,M + E/^i J 

\i=i j=i j=i I 

p 

^Y. PMQ{93 - fs,j)) > 2~'^c\\g, - fsjl + I3j). 

j=l 9j&J^j,s 

We now use P{vn{h) > Cj < 2~^ exp(— n^^/[2||/i||2]); compare this to the 
proof of Theorem 1. This gives 

P{\\f-f\\l>t + Q) 



i=i gj&J^j,s 



n{2-^Cc\\9j-f5,M+Pj)' 
2\\Qigj-f5M' 
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p 

= ^2"^exp[-<c4"i(l - 2C)Wjp'^'^t]. 
i=i 

By plugging this into (82), we get 

P j-OO 

E{\\f-f\\l)<Q + Y. / ^M-<cA~\l-2i)w.jpft]dt 
<C + n-i4[^c(l-2erM f^p, 

\j=l 

Choosing ,^ = gives the statement of Theorem 5. 
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